Useful tools for easy explatory data analysis (EDA)
Off the shelf and simple functions for data analysis
- pandas_profile
- pyviz
- resumetable
- feature_transform (my library)
I will explore the transformations using the wine data set from Kaggle.
from pathlib import Path
import pandas as pd
#import numpy as np
#from scipy.stats import kurtosis, skew
from scipy import stats
# import math
# import warnings
# warnings.filterwarnings("error")
from google.colab import drive
mnt=drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'redwine'
csv_path = (base_dir+'/winequality-red.csv')
df=pd.read_csv(csv_path)
# https://gist.github.com/harperfu6/5ea565ee23aaf8461a840c480490cd9a
pd.set_option("display.max_rows", 1000)
def resumetable(df):
print(f'Dataset Shape: {df.shape}')
summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name', 'dtypes']]
summary['Missing'] = df.isnull().sum().values
summary['Uniques'] = df.nunique().values
summary['First Value'] = df.loc[0].values
summary['Second Value'] = df.loc[1].values
summary['Third Value'] = df.loc[2].values
for name in summary['Name'].value_counts().index:
summary.loc[summary['Name'] == name, 'Entropy'] = \
round(stats.entropy(df[name].value_counts(normalize=True), base=2), 2)
return summary
Typically, the first thing to do is examine the first rows of data, but this just gives you a very rudimentary feel for the data.
df = pd.read_csv(csv_path)
df.head()
I found resumetable() to be very convenient. We get a sense of cardinaltiy from Uniques and we can easily see where we are missing data.
Also, knowing the datatypes of each column is helpful when in comes to pre-processing the data.
I came across this funtion on Kaggle(I think) and found it incredibly helpful
resumetable(df)
Another tool I use is pandas_profiling
import sys
!"{sys.executable}" -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension
from ipywidgets import widgets
# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
profile = ProfileReport(df, title="red wine", html={"style": {"full_width": True}}, sort="None")
Takes a couple minutes to process and display the results even with a small dataset.
You do get some richer analysis like Correlation plots and distributions of the variables.
profile.to_widgets()
An alternative is Sweetviz. I tend to like this a bit better for its display of distributions. In general, it also loads a bit quicker.
!pip -q install sweetviz
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_notebook(w=1200.)
I wanted a simple way to view the distributions of the features and more importantly a way to view the data after numerical transformations such as Box-Cox or a log transform.
The following plot is a sample of what I developed.
